In this notebook, we will apply a feature importance (FI) pipeline to:ΒΆ
- Load the transformed train CSV file into Pandas dataframe for analysis.
- Split the dataset into train (75%) and test (25%) subsets.
- Use
category_encodersto encode the categorical feature usingOneHotEncoder. - Use
lightgbmto train aLGBMRegressorregression model using the train dataset. - Produce predictions of wine quality using the test subset.
- Use
shapashto generate the feature importance plot usingSmartExplainer. - For selected features, analyse how each feature affects predictions (feature contribution).
- Recommend any decisions to be applied on feature selection before model building.
Note:
The model training step in the above pipeline has been applied for feature importance analysis only. Proper model training processes, which includes trying different algorithms and tuning model hyperparameters will be conducted in the model building stage and will be documented in a different notebook.
import pandas as pd
from category_encoders import OneHotEncoder
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split
from IPython.display import display
from shapash import SmartExplainer
# Data source & destination path
in_out = '../data/'
# Load the train CSV file into Pandas dataframe for analysis
train_df = pd.read_csv(f'{in_out}train_transformed.csv')
feature_descriptions = {
"fixed acidity": "Acidity from fixed acids",
"volatile acidity": "Acidity from volatile acids",
"citric acid": "Amount of citric acid",
"pH": "pH level of wine",
"sulphates": "Amount of sulphates",
"alcohol": "Alcohol content",
"quality": "Quality score of wine",
"wine_type_red": "Red wine type",
"wine_type_white": "White wine type",
"wine_type": "Wine type",
"residual_sugar_density_mean": "Residual sugar & density mean",
"chlorides_density_ratio": "Ratio of chlorides to density",
"sulfur_dioxide_mean": "Mean of sulfur dioxide"
}
y_df=train_df['quality'].to_frame()
X_df=train_df[train_df.columns.difference(['quality'])]
train_df.head()
| fixed acidity | volatile acidity | citric acid | pH | sulphates | alcohol | quality | wine_type | residual_sugar_density_mean | chlorides_density_ratio | sulfur_dioxide_mean | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.943612 | 0.609132 | 0.611565 | -1.011681 | -0.021103 | 0.477217 | 6.0 | white | -0.230202 | -5.309967 | 0.184676 |
| 1 | 0.455066 | 2.106564 | -1.496435 | 1.045176 | -0.101079 | -1.071319 | 5.0 | red | -0.071295 | 0.899096 | -0.451677 |
| 2 | 1.197274 | -0.367374 | 0.203565 | 0.484215 | 1.616734 | 1.131435 | 7.0 | red | -0.273300 | 1.665372 | -1.864449 |
| 3 | -0.974228 | -0.974543 | 0.543565 | 0.234899 | 1.476658 | 1.051930 | 5.0 | white | -0.908618 | 1.102921 | 0.400251 |
| 4 | -0.665612 | -0.367374 | 0.815565 | -0.450720 | -0.344227 | -1.170216 | 5.0 | white | 0.337231 | -1.053460 | 1.077442 |
# Encoding Categorical Features
categorical_features = [col for col in X_df.columns if X_df[col].dtype == 'object']
encoder = OneHotEncoder(
cols=categorical_features,
handle_unknown='ignore',
return_df=True).fit(X_df)
X_df = encoder.transform(X_df)
# Train / Test Split
Xtrain, Xtest, ytrain, ytest = train_test_split(X_df, y_df, train_size=0.75, random_state=1)
# Model Fitting
regressor = LGBMRegressor(n_estimators=200).fit(Xtrain,ytrain)
y_pred = pd.DataFrame(regressor.predict(Xtest),columns=['pred'],index=Xtest.index)
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000360 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1170 [LightGBM] [Info] Number of data points in the train set: 2577, number of used features: 11 [LightGBM] [Info] Start training from score 5.801319
# Declare and Compile SmartExplainer
xpl = SmartExplainer(
model=regressor,
preprocessing=encoder,
features_dict=feature_descriptions
)
xpl.compile(x=Xtest,
y_pred=y_pred,
y_target=ytest,
)
# Display features importance
xpl.plot.features_importance()
ObservationsΒΆ
The feature importance output provides valuable insights into the predictive power of each feature in the dataset.
Hereβs a summary of the observations:
Alcohol content is the most important feature, contributing approximately 34.44% to the modelβs predictive power. This suggests that the alcohol content in wine significantly influences its quality.
The second most important feature is Acidity from volatile acids, contributing about 15.48%. This indicates that volatile acidity also plays a crucial role in determining wine quality.
Amount of sulphates and Mean of sulfur dioxide contribute around 9.97% and 8.64% respectively, showing their moderate importance in the model.
Features like Residual sugar & density mean, Ratio of chlorides to density, and pH level of wine have similar importance scores ranging from approximately 6% to 7%.
The least important features are Amount of citric acid, Acidity from fixed acids, and Wine type, with importance scores around and less than 5%.
Understand how selected features contribute to wine qualityΒΆ
xpl.plot.contribution_plot("alcohol")
There seems to be a very strong positive correlation between alcohol content and wine quality.
xpl.plot.contribution_plot("fixed acidity")
Although the importance of Acidity from fixed acids is around 5%, there still seems to be a moderate negative correlation with wine quality.
xpl.plot.contribution_plot("wine_type")
Although the importance of Wine type is less than 5%, red wine types generally have higher quality scores than white wine types.
RecommendationΒΆ
Based on the above analysis, it's evident that the importance of various features in predicting the wine quality score varies. Some features have importance values around or higher than 15%, whereas other features have importance values around or below 5%. However, it's recommended to include all features in the model-building process. Even features with lesser importance, such as Acidity from fixed acids and Wine type, exhibit a moderate correlation with the wine quality score.